- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Implement SparseK Attention mechanism — new GGML operator with CPU backend (GPU planned next) #16817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…or definition and tensor creation, backend implementation pending to ggml.c/h Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
| Hi @CISC and @NeoZhangJianyu, We’d appreciate it if you could review our PR implementing the new SPARSEK Attention operator. This contribution was developed jointly by both of us (@yael-works  and @GittyBurstein ). Thanks in advance for your time and feedback! | 
| We are talking about this SparseK, right? | 
| yes! @CISC | 
…gml.h Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
Co-authored-by: Yael Shuker <[email protected]> Co-authored-by: Gitty Burstein <[email protected]>
Co-authored-by: Yael <[email protected]> Co-authored-by: Tamar <[email protected]>
| You need to rebase to fix Server CI failures, also please fix whitespaces: | 
Co-authored-by: Yael <[email protected]> Co-authored-by: Gitty <[email protected]>
…l <[email protected]> Co-authored-by: Gitty <[email protected]>
…-ops.cpp Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Gitty Burstein <[email protected]> Co-authored-by: Yael Shuker <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
| Hi @CISC, I’d really appreciate it if you could review the code itself so we can move forward with the merge — Thanks! | 
| 
 Yes, as mentioned, will be resolved if you rebase, it's ok. :) 
 So, my main challenge is where/what/when will SparseK be used? I can't recall seeing any actual implementation being used in the wild. This also means we don't really have any reference to test it against... | 
| @CISC Once this PR is merged, the operator can be connected to higher-level use cases such as: 
 Thank you!! | 
| I think @ggerganov will have to weigh in on this. | 
New Attention Mechanism: SparseK Attention (CPU Backend)
This PR introduces a new attention mechanism called SparseK Attention, implemented from scratch as a new operator within the GGML framework, currently with CPU backend support.
Overview
SparseK Attention is a selective and efficient attention mechanism inspired by Flash Attention, but introduces additional sparsity through:
Implementation Details
GGML_OP_SPARSEK_ATTNdefined inggml.handggml.c.ggml_sparsek_attn()that creates a computation node with parameters (k_top,win_local,stride_global).ggml-cpu/ops.hggml-cpu/ops.cppggml-cpu.cThe CPU version includes:
QKᵀ / √dNext Steps
Our next goal is to extend SparseK Attention to the SYCL (GPU) backend in order to:
We are submitting this initial CPU implementation first to ensure review, integration, and baseline correctness before introducing GPU acceleration.
Co-Authors
Co-authored-by: Yael Shuker ([email protected])
Co-authored-by: Gitty Burstein ([email protected])